One Gene may be translated into mRNA and then spliced into multiple transcripts named isoforms. Isoforms may be modulated by splicing factors within the cell.1
RNAseq takes a snapshot of a cell’s gene expression profile at the time of sequencing. Short read sequencing, 100-250 base pairs, required estimating which isoform is present since the read may be contained within one exon. Long read sequencing may span multiple exons, providing more confidence on the transcript isoform detected at the expense of read depth.
Goal: Develop an isoform-grouping method to facilitate isoform-level differential expression (DE) analysis using long-read sequencing data while controlling for False Positives for differential expression.
A gene with \(N\) isoforms implies \(N-1\) inner nodes for its associated tree. These inner nodes are the sum of the leaves of a sample.
Once we have our extended data, we perform DESeq22 and evaluate the resulting pvalues with our tree Climbing algorithm.
To Generate Hierarchical clusters, we generated a similarity metric based on the similarities between transcripts as opposed to using data dependent counts. Let \(G\) represent a set of isoforms of size \(g\). given any indexes \(i\), \(j \le g\) , define \(G_i\) and \(G_j\) as isoforms \(i\) and \(j\) from \(G\) such that they represent sets of exons of size \(N\) and \(M\) respectively. For any two \(i\) and \(j\), we can define the similarity as:
\[ S_{ij}(G_i,G_j) = \frac{2\sum_n^N\sum_m^MJ(G_{i_n}, G_{j_m})}{N + M} \hspace{1cm} J(G_{i_n}, G_{j_m}) = \frac{G_{i_n}\cap G_{j_m}}{G_{i_n}\cup G_{j_m} } \]
Choose an inner node within the tree and shift the mean of all leaves for a particular group by some delta. We can evaluate our Tree climbing algorithm based on how well it accurately chooses the known perturbed nodes.
Let \(\mu_{ij}=\mu_0 + \delta_{ij}\), \(\mu_0 = 10\) and \(\delta_{ij} = 0\) for \(i,j\) in the control group, and \(\delta_{ij} = \hat \delta\) for \(i,j\) in the affected group. Each entry, \(X_{ij}\), is sampled as follows. \[ X_{ij} \sim \text{Nbinom}(\mu_{ij},\ \alpha = 100) \]
We conducted 50 simulations per \(\hat\delta \in \{0.5, 1, 1.5, ..., 6\}\) and evaluated how often our tree climbing algorithm correctly merged the known perturbed data. Merged nodes with \(\delta_{ij}=0\) are False Positives and unmerged nodes of with \(\delta_{ij}=\hat\delta\) are considered False Negatives in the tree climbing context.
While at higher \(\hat\delta\) shifts in our simulations are able to identify the correct inner node, but we also need to consider the case of the null hypothesis. To assess the distribution of pvalues under the assumption that there is no difference in the \(\mu_{ij}\), we ran simulations with \(\hat\delta = 0\).
There is an enrichment of low pvalues among the inner nodes in the null hypothesis simulations. To correct this we can apply count splitting. The count splitting method is formulated for experiments that use the same data for feature selection as they use for analysis.3
\[ X_{ij} \sim \text{Nbinom}(\mu_{ij}, \alpha = 100)\\ X_{ij}^\text{train} \sim \text{Bin}(X_{ij}, \theta = 0.5)\\ X_{ij}^\text{test} = X_{ij} - X_{ij}^\text{train} \]